feat: Recover interrupted dlc protocol on restart #1817

holzeis · 2024-01-15T12:40:22Z

This change introduces a naive mechanism to recover from an interrupted dlc protocol (e.g. lost connection or a restart) by persisting the hash of any received message and resending the last outbound dlc message on any reconnect.

Upon connecting to the peer, the last send dlc message is looked up from the database (last_outbound_dlc_messages) and is naively resend to the counterparty (this is applicable for both sides).
Once the counterparty receives the last message, it will check if it has already processed that message looking up the hash of the dlc message in the dlc_messages table.
If it has already seen the message it will simply ignore it if not, it will continue processing it as it appears it must have missed it the first time, thus continuing an interrupted protocol.

All dlc messages are stored into the dlc_messages table once successfully processed
The last outbound dlc message is stored to the last_outbound_dlc_messages table. Note, that since a dlc message can get quite large we are only keeping the last message around so to not waste space.

In order to keep track of all outbound dlc messages I had to introduce the SendDlcMessage Event, which re-routes the sending over the app / coordinator layer who is then able to store the message firstly into the dlc_messages and last_outbound_dlc_messages before actually sending the message to the counter party. This way we ensure we are always able to resend the last message.

TODO (Not necessarily everything in this PR)

If the protocol got interrupted after processing an Offer, but did not process the accept yet. The protocol will still get stuck. feat: Recover from pending offer #1822
sync_position_to_subchannel should be removed, adapted for a dlc channel.
Upon restarting the app within an incomplete dlc channel state a catchup dialog should be shown.
Testing

Future work / Notes:

I had to provide a custom OnionMessageHandler (TenTenOneOnionMessageHandler) to get an reliable hook on when a peer connected. This seems to be far more reliable than to rely on the websocket connection and should be considered for triggering our rollover and expired position actions.
Since we are using different data sources with independent transactional systems it can happen that we are persisting things into the sled database (dlc), but not into the sqlite database (10101). This should get addressed eventually. e.g. by also storing the dlc data structures into the same sqlite database.

bonomat

Very nice - good work. I think some code duplication can be avoided but otherwise this looks good to me.

coordinator/src/admin.rs

crates/ln-dlc-node/src/node/connection.rs

coordinator/migrations/2024-01-10-105231_add_dlc_messages/up.sql

coordinator/src/bin/coordinator.rs

crates/ln-dlc-node/src/dlc_message.rs

coordinator/src/dlc_handler.rs

crates/ln-dlc-node/src/tests/mod.rs

mobile/native/src/db/dlc_messages.rs

mobile/native/src/dlc_handler.rs

…annels

Extending the rust-dlc storage api to also be able to get non-signed channels.

Publishes an event if a peer connected successfully using the onion message handler hook.

holzeis · 2024-01-16T11:20:06Z

@bonomat I've addressed all your review remarks and pushed them as fixups on top of my changes. Please have another look.

CHANGELOG.md

bonomat

Looks good.

Upon connecting to the peer, the last send dlc message is looked up from the database (`last_outbvound_dlc_messages`) and is naively resend to the counterparty (this is applicable for both sides). Once the counterparty receives the last message, it will check if it has already processed that message looking up the hash of the dlc message in the `dlc_messages` table. If it has alrady seen the message it will simply ignore it if not, it will continue processing it as it appears it must have missed it the first time, thus continuing an interrupted protocol. - All dlc messages are stored into the `dlc_messages` table once successfully processed - The last outbound dlc message is stored to the `last_outbound_dlc_messages` table. Note, that since a dlc message can get quite large we are only keeping the last message around so to not waste space. In order to keep track of all outbound dlc messages I had to introduce the `SendDlcMessage` Event, which re-routes the sending over the app / coordinator layer who is then able to store the message firstly into the `dlc_messages` and `last_outbound_dlc_messages` before actually sending the message to the counter party. This way we ensure we are always able to resend the last message. Note, since we are using different data sources with independet transactional systems it can happen that we are persting things into the sled database (dlc), but not into the sqlite database (10101). This should get addressed eventually. e.g. by also storing the dlc data structures into the same sqlite database.

After e268c62, the order can't be copied anymore.

The node was still using its own implementation for sending dlc message, this was primarily needed because it also had to support `SubChannelMessages`. This change removes all subchannel and onchain protocol handlings from processing incoming dlc messages, as those protocols aren't used. By switching to the common `NodeEvent::SendDlcMessage` we can rely on the outbound message being persisted during sending and can get rid of the `DlcMessageStep`.

Instead of processing the position state on the outbound dlc message. IMHO the logic is easier to understand that way and the code is more cohesive that way. (at least all the code that needs to be refactored is now in one place :)

By default this is set to 2MB, according to the docs[1]. We need larger bodies than before (even in the e2e test!), because we added lots of CET adaptor signatures to the database recently[2]. I also played around with nginx's `client_max_body_size`, but that's actually irrelevant since we don't use nginx as a reverse proxy for the coordinator in the e2e tests. Related to #1719. [1]: https://docs.rs/axum/latest/axum/extract/struct.DefaultBodyLimit.html#method.disable. [2]: #1817.

holzeis requested review from bonomat and luckysori January 15, 2024 12:40

holzeis self-assigned this Jan 15, 2024

holzeis marked this pull request as ready for review January 15, 2024 12:47

holzeis force-pushed the feat/dlc-channel-recovery-on-restart branch from d8694b6 to 5b35de5 Compare January 15, 2024 12:48

holzeis changed the title ~~chore: Adapt dlc channels api~~ chore: Recover interrupted dlc protocol on restart Jan 15, 2024

holzeis force-pushed the feat/dlc-channel-recovery-on-restart branch from 5b35de5 to eca91d1 Compare January 15, 2024 13:14

holzeis changed the title ~~chore: Recover interrupted dlc protocol on restart~~ feat: Recover interrupted dlc protocol on restart Jan 15, 2024

holzeis force-pushed the feat/dlc-channel-recovery-on-restart branch 2 times, most recently from e2d1549 to ce27de9 Compare January 15, 2024 14:13

bonomat reviewed Jan 16, 2024

View reviewed changes

holzeis added 4 commits January 16, 2024 12:16

chore: Adapt dlc channels api to return dlc channel instead of sub ch…

7105fc1

…annels

chore: Fix compile errors in test-e2e crate

6f214c7

chore: Get all channels

c2bb869

Extending the rust-dlc storage api to also be able to get non-signed channels.

feat: Add basic node event handler

207f5c5

Publishes an event if a peer connected successfully using the onion message handler hook.

holzeis force-pushed the feat/dlc-channel-recovery-on-restart branch from 30da175 to 876bdab Compare January 16, 2024 11:16

holzeis requested a review from bonomat January 16, 2024 11:19

bonomat reviewed Jan 16, 2024

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

bonomat approved these changes Jan 16, 2024

View reviewed changes

holzeis force-pushed the feat/dlc-channel-recovery-on-restart branch from 876bdab to a724da4 Compare January 16, 2024 13:30

holzeis added 4 commits January 16, 2024 14:31

chore: Fix compile error in test subscribers

7bdbe59

After e268c62, the order can't be copied anymore.

holzeis force-pushed the feat/dlc-channel-recovery-on-restart branch from a724da4 to 14a2169 Compare January 16, 2024 13:32

holzeis enabled auto-merge January 16, 2024 13:36

holzeis added this pull request to the merge queue Jan 16, 2024

Merged via the queue into main with commit b4ef2c7 Jan 16, 2024
9 checks passed

holzeis deleted the feat/dlc-channel-recovery-on-restart branch January 16, 2024 14:32

holzeis mentioned this pull request Jan 16, 2024

DLC-channel protocol recovery on restart #1784

Closed

holzeis linked an issue Jan 16, 2024 that may be closed by this pull request

DLC-channel protocol recovery on restart #1784

Closed

luckysori mentioned this pull request Jan 18, 2024

Failed to upload backup: too large #1719

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Recover interrupted dlc protocol on restart #1817

feat: Recover interrupted dlc protocol on restart #1817

holzeis commented Jan 15, 2024 •

edited

Loading

bonomat left a comment

holzeis commented Jan 16, 2024

bonomat left a comment

feat: Recover interrupted dlc protocol on restart #1817

feat: Recover interrupted dlc protocol on restart #1817

Conversation

holzeis commented Jan 15, 2024 • edited Loading

TODO (Not necessarily everything in this PR)

Future work / Notes:

bonomat left a comment

Choose a reason for hiding this comment

holzeis commented Jan 16, 2024

bonomat left a comment

Choose a reason for hiding this comment

holzeis commented Jan 15, 2024 •

edited

Loading